Utility-Based Control Feedback in a Digital Library Search Engine: Cases in CiteSeerX

نویسندگان

  • Jian Wu
  • Alexander Ororbia
  • Kyle Williams
  • Madian Khabsa
  • Zhaohui Wu
  • C. Lee Giles
چکیده

We describe a utility-based feedback control model and its applications within an open access digital library search engine – CiteSeerX, the new version of CiteSeer. CiteSeerX leverages user-based feedback to correct metadata and reformulate the citation graph. New documents are automatically crawled using a focused crawler for indexing. Those documents that are ingested have their document URLs automatically inspected so as to provide feedback to a whitelist filter, which automatically selects high quality crawl seed URLs. The changing citation count plus the download history of papers is an indicator of ill-conditioned metadata that needs correction. We believe that these feedback mechanisms effectively improve the overall metadata quality and save computational resources. Although these mechanisms are used in the context of CiteSeerX, we believe they can be readily transferred to other similar systems.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

CiteSeerX: AI in a Digital Library Search Engine

CiteSeerX is a digital library search engine that provides access to more than 4 million academic documents with nearly a million users and millions of hits per day. Artificial intelligence (AI) technologies are used in many components of CiteSeerX e.g. to accurately extract metadata, intelligently crawl the web, and ingest documents. We present key AI technologies used in the following compone...

متن کامل

CiteSeer x in the Cloud Pradeep

Information retrieval applications are are good candidates for hosting in a cloud infrastructure. CiteSeerx, a digital library and search engine, was built with the goal of efficiently disseminating scientific information and literature over the web. The framework for CiteSeerx, an application of the SeerSuite software, was designed with a focus on extensibility and scalability. Its loosely cou...

متن کامل

CiteSeerx: A Cloud Perspective

Information retrieval applications are good candidates for hosting in a cloud infrastructure. CiteSeerx a digital library and search engine was built with the goal of efficiently disseminating scientific information and literature over the web. The framework for CiteSeerx as an application of the SeerSuite software is a design built with extensibility and scalability as fundamental features. Th...

متن کامل

Graph-based Approach to Automatic Taxonomy Generation (GraBTax)

We propose a novel graph-based approach for constructing concept hierarchy from a large text corpus. Our algorithm, GraBTax, incorporates both statistical co-occurrences and lexical similarity in optimizing the structure of the taxonomy. To automatically generate topic-dependent taxonomies from a large text corpus, GraBTax first extracts topical terms and their relationships from the corpus. Th...

متن کامل

Scalability Bottlenecks of the CiteSeerX Digial Library Search Engine

As the document collection and user population increase, the capability and performance of a digital library such as CiteSeerX maybe limited by some bottlenecks. This paper describes the current infrastructure of the CiteSeerX academic digital library search engine, outlines its current bottlenecks and proposes feasible solutions. These bottlenecks exist in various components of the system incl...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014